In economics and data science, we often work with tabular data, whether they come from:
Surveys and administrative records
Macroeconomic databases (e.g., World Bank, IMF)
Financial market data feeds
Census and demographic data
Firm-level accounting statements
While NumPy arrays are highly efficient for numerical operations, they lack some of the conveniences required to handle real-world economic data:
Row and column labels (e.g., country names, dates)
Mixed data types (e.g., numeric GDP values with categorical country codes)
Missing values (e.g., unreported statistics for some countries/years)
Hierarchical relationships (e.g., firms nested within industries within countries)
This is where pandas excels. Built on top of NumPy, pandas provides flexible, Pythonic data structures—namely Series (1D) and DataFrames (2D)—that support common operations like data cleaning, subsetting, merging, and reshaping.
The DataFrame in pandas is the closest analog to what economists and social scientists might see in Stata, R, or Excel: rows representing observations, columns representing variables, and the ability to label these axes in intuitive ways. This chapter will guide you through:
Data Structures: Understanding Series and DataFrames
Data Loading: Reading from files and databases
Data Cleaning: Handling missing values and data types
Data Manipulation: Filtering, sorting, and transforming
Data Analysis: Computing statistics and aggregations
Data Export: Saving results in various formats
These skills are crucial for real-world applications in economics and machine learning, where data rarely comes in a perfectly clean and ready-to-use format.
16.2 Pandas Series: The Building Block of DataFrames
Before diving deeper into DataFrames, it’s crucial to understand Series, the fundamental building block of pandas. A Series is a one-dimensional labeled array that can hold any data type (integers, strings, floats, Python objects, etc.). In economic data analysis, individual Series often represent:
Time series of a single variable (e.g., monthly GDP values)
Cross-sectional observations (e.g., population across countries)
Categories or classifications (e.g., industry sectors)
Think of a Series as analogous to a single column in Stata or a vector in R, but with added capabilities for handling missing data and maintaining row labels (index). This labeling is particularly valuable when working with economic data where observations often have meaningful identifiers like country names, dates, or industry codes.
16.2.1 Relationship with NumPy
Since pandas is built on top of NumPy, there’s a close relationship between pandas Series/DataFrames and NumPy arrays:
import numpy as npimport pandas as pd# Convert NumPy array to Seriesarr = np.array([1, 2, 3])s = pd.Series(arr)# Convert Series to NumPy arrayarr_back = s.to_numpy()# Convert 2D NumPy array to DataFramearr_2d = np.array([ [1, 2, 3], [4, 5, 6]])df = pd.DataFrame( arr_2d, columns=['A', 'B', 'C'])# Convert DataFrame to 2D NumPy arrayarr_2d_back = df.to_numpy()# NumPy functions work on Series/DataFrames# Element-wise square root of Seriesprint(np.sqrt(s)) # Mean of each DataFrame columnprint(np.mean(df)) # Exponential of one columnprint(np.exp(df['A'])) # Universal functions (ufuncs) work too# Element-wise exponentialprint(np.exp(s))
While pandas builds on NumPy’s efficient array operations, there are several key differences that make pandas Series more flexible for data analysis. First, Series have an index that labels each position in the array, allowing for intuitive data access using meaningful identifiers rather than just integer positions. This is particularly useful when working with time series data or when rows represent distinct entities like countries or companies. Additionally, Series preserve pandas-specific features that aren’t available in NumPy arrays, such as sophisticated missing value handling.
16.2.2 Creating Series
You can create a Series from a list, NumPy array, or dictionary:
import pandas as pdimport numpy as np# From a lists1 = pd.Series( [10, 20, 30, 40])print(s1)# 0 10# 1 20# 2 30# 3 40# dtype: int64# From a NumPy arrays2 = pd.Series( np.array([1, 3, 5, 7]))print(s2)# 0 1# 1 3# 2 5# 3 7# dtype: int64# From a dictionary (keys become index)s3 = pd.Series({'a': 100, 'b': 200, 'c': 300})print(s3)# a 100# b 200# c 300# dtype: int64
16.3 Pandas DataFrames
A DataFrame is essentially a collection of Series objects that share the same index. Each column in a DataFrame is a Series. This relationship is crucial for understanding how to extract and manipulate data within a DataFrame.
16.3.1 Creating a DataFrame by combining Series
You can create a DataFrame by combining multiple Series objects. The Series must share the same index for proper alignment:
# Create individual Seriesgdp = pd.Series( [19.4, 3.0, 4.9], index=['USA', 'UK', 'Japan'])pop = pd.Series( [332, 67, 126], index=['USA', 'UK', 'Japan'])# Combine into DataFramedf = pd.DataFrame({'GDP_trillion': gdp,'Population_million': pop})print(df)# GDP_trillion Population_million# USA 19.4 332.0# UK 3.0 67.0# Japan 4.9 126.0
The resulting DataFrame uses the Series’ shared index as row labels and the dictionary keys as column names.
Later we will see how to import data from files into DataFrames.
16.3.2 Extracting a Series from a DataFrame
You can extract a column (which is a Series) from a DataFrame using bracket notation:
# Assuming df is a DataFramegdp_series = df["GDP"]print(type(gdp_series))# <class 'pandas.core.series.Series'>
16.3.3 Adding a Series to a DataFrame
You can add a Series as a new column to a DataFrame:
# Create a new Seriesnew_series = pd.Series( [100, 200, 300], index=['USA', 'UK', 'Japan'])# Add it as a new column df["New_Column"] = new_seriesprint(df)# GDP New_Column# USA 100 100# UK 200 200# Japan 300 300
16.4 Understanding Row Identification in pandas
Before diving into specific operations, it’s important to understand how pandas identifies and organizes rows in a DataFrame. This conceptual foundation is crucial for effectively working with data, especially when performing operations like filtering, grouping, or panel data analysis.
16.4.1 Index Types
Default Index
When you create a DataFrame without specifying an index, pandas assigns a default integer index (0, 1, 2, …).
You can assign meaningful labels to rows using any hashable type (strings, integers, dates).
Common in economics for time series or cross-sectional data.
df.index = ['USA', 'UK', 'Japan']print(df)# GDP# USA 100# UK 200# Japan 300
Multi-Level (Hierarchical) Index
Rows can be identified by multiple levels of labels.
Essential for panel data and nested categorizations.
# Create DataFrame with multi-level indexdf = pd.DataFrame({'GDP': [100, 120, 200, 220],'Year': [2020, 2021, 2020, 2021],'Country': ['USA', 'USA', 'UK', 'UK']})# Set multi-level indexdf_panel = df.set_index( ['Country', 'Year'])print(df_panel)# GDP# Country Year # USA 2020 100# 2021 120# UK 2020 200# 2021 220
16.4.2 Working with Indices
Setting and Resetting
# Set a single column as indexdf = df.set_index('date')# Set multiple columns as index levelsdf = df.set_index(['country', 'year'])# Convert index back to regular columnsdf = df.reset_index()
Index Operations
# Check index propertiesprint(df.index.names) # Names of index levelsprint(df.index.levels) # Values in each level# Sort by indexdf = df.sort_index()
Accessing Data
Different index types enable various data access patterns:
Understanding these index concepts is fundamental to effective data analysis in pandas. The index system determines how you select and filter data, enables powerful grouping and aggregation operations, and provides natural representation for hierarchical data structures. This flexibility is particularly important when working with time series, where dates serve as indices, or when dealing with panel data that requires multiple levels of identification.
The hierarchical indexing capabilities of pandas make it especially powerful for handling complex data structures. Whether you’re analyzing panel data with country-year observations, studying nested relationships across industries and regions over time, or conducting cross-sectional studies with multiple categorical dimensions, the index system provides the foundation for efficient data organization and analysis.
16.5 Loading and Inspecting Datasets
Real-world data often exist in CSV files, Excel spreadsheets, or databases. pandas provides convenient functions to load data from these sources. For example, we load a dataset in CSV format which contain details of atheletes who participate in a particular year’s Olympic games.
import pandas as pd# Reading a CSV filedf = pd.read_csv("../data/athletes.csv")# Peek at the first few rowsprint(df.head())# Get DataFrame shape (rows, columns)print(df.shape)# Check column namesprint(df.columns)
id name nationality sex dob height weight \
0 736041664 A Jesus Garcia ESP male 10/17/69 1.72 64.0
1 532037425 A Lam Shin KOR female 9/23/86 1.68 56.0
2 435962603 Aaron Brown CAN male 5/27/92 1.98 79.0
3 521041435 Aaron Cook MDA male 1/2/91 1.83 80.0
4 33922579 Aaron Gate NZL male 11/26/90 1.81 71.0
sport gold silver bronze
0 athletics 0 0 0
1 fencing 0 0 0
2 athletics 0 0 1
3 taekwondo 0 0 0
4 cycling 0 0 0
(11538, 11)
Index(['id', 'name', 'nationality', 'sex', 'dob', 'height', 'weight', 'sport',
'gold', 'silver', 'bronze'],
dtype='object')
pd.read_csv(): Reads comma-separated values (CSV). You can customize delimiters (sep='\t' for TSV files), specify headers, handle missing values, and more via additional parameters.
df.head(): Shows the first 5 rows (or another number you specify).
df.shape: Returns a tuple \((\text{rows},\text{columns})\).
df.columns: Lists the column names, which is helpful to quickly see the variables available in the dataset.
You can also load Excel files with pd.read_excel("file.xlsx"), JSON data with pd.read_json("file.json"), and more.
If you’d like to work along with the examples in this chapter, you can download it from https://mlbook.jyotirmoy.net/static/data/athletes.csv. Adjust the path in the read_csv call to the path where you saved the file.
16.5.1 Examining Continuous Columns
Many datasets have both continuous (height, weight) and categorical (e.g., sport, nationality) columns. You can quickly describe numeric and non-numeric data:
name nationality sex dob sport
count 11538 11538 11538 11537 11538
unique 11517 207 2 5595 28
top Gabriella Szucs USA male 2/18/93 athletics
freq 2 567 6333 9 2363
For categorical data, .describe() shows count, unique category count, top category, and frequency of the top category.
16.6 Data Types (dtypes)
Each column in a DataFrame has an associated data type, which can significantly impact memory usage and performance. Common dtypes include float, int, bool, datetime, and object (for strings or mixed content).
print(df.dtypes)
id int64
name object
nationality object
sex object
dob object
height float64
weight float64
sport object
gold int64
silver int64
bronze int64
dtype: object
This reveals how pandas interprets each column. If a particular column (like a categorical variable or date) is not recognized correctly, you can change its type:
When reading CSV files, string columns are automatically stored as ‘object’ dtype, which is pandas’ way of handling Python strings and mixed data types. While flexible, object dtypes consume more memory because each value is stored as a separate Python object with its own overhead.
For columns with a limited set of unique values (like ‘sex’, ‘nationality’), converting to ‘category’ dtype can significantly reduce memory usage and improve performance:
The dob column was not correctly recognized that as a date. Lets convert that too:
# Convert date columns from object to datetime# For example, if 'dob' column contains dates like '01/15/95':df['dob'] = pd.to_datetime( df['dob'], format='%m/%d/%y')# Now df['dob'] is datetime64[ns] type instead of object
16.7 Cross-tabulation
Cross-tabulation (or contingency tables) is a method to analyze the relationship between categorical variables by displaying their frequency distributions in a table format. In economics, this is particularly useful for:
Analyzing employment status across education levels
Studying industry distribution across regions
Examining trade relationships between countries
Understanding market segment distributions
pandas provides the pd.crosstab() function for creating these tables:
# Normalize by rows (proportions)pd.crosstab( df['nationality'], df['sex'], normalize='index'# Row percentages)
sex
female
male
nationality
AFG
0.333333
0.666667
ALB
0.500000
0.500000
ALG
0.147059
0.852941
AND
0.600000
0.400000
ANG
0.692308
0.307692
...
...
...
VIE
0.565217
0.434783
VIN
0.500000
0.500000
YEM
0.333333
0.666667
ZAM
0.285714
0.714286
ZIM
0.742857
0.257143
207 rows × 2 columns
# Normalize by columns (proportions)pd.crosstab( df['nationality'], df['sex'], normalize='columns'# Column percentages)
# Multiple row variables by # providing a list of columns# as the rowspd.crosstab( [df['nationality'],df['sport']], df['sex'] )
sex
female
male
nationality
sport
AFG
athletics
1
1
judo
0
1
ALB
aquatics
1
1
athletics
1
1
weightlifting
1
1
...
...
...
...
ZIM
athletics
1
5
equestrian
1
0
football
22
0
rowing
1
1
shooting
0
1
1776 rows × 2 columns
We can similarly have multiple columns to create a hierarchical table.
The aggfunc parameter in crosstab determines how to combine multiple values that fall into the same category combinations. For example, when analyzing company data by Industry and Region:
sum: Adds up all values (e.g., total revenue for each Industry-Region pair)
mean: Calculates average (e.g., average company size per Industry-Region)
count: Counts occurrences (e.g., number of companies in each Industry-Region)
min/max: Finds extreme values (e.g., largest company in each Industry-Region) Custom functions can also be used for more complex aggregations.
Suppose for each country we want the average height men and women.
# With aggregation functionpd.crosstab( df['nationality'], df['sex'], values=df['height'], # Values to aggregate aggfunc='mean'# Aggregation function)
sex
female
male
nationality
AFG
1.650000
1.780000
ALB
1.663333
1.860000
ALG
1.664000
1.772241
AND
1.676667
1.765000
ANG
1.754444
1.780000
...
...
...
VIE
1.643846
1.690000
VIN
1.780000
1.730000
YEM
1.660000
1.715000
ZAM
1.760000
1.770000
ZIM
1.643462
1.770000
202 rows × 2 columns
The aggfunc parameter determines how values are combined when multiple rows map to the same Industry-Region combination. For example, if you have multiple companies in the same industry and region, aggfunc='sum' will add their revenues together. Other common options include:
‘mean’: Average value (e.g., average revenue per industry-region)
‘count’: Number of occurrences
‘min’/‘max’: Extreme values
Custom functions can also be used for more complex aggregations
16.8 Computing Aggregates and Summary Statistics
Summaries and group-level statistics are integral to data analysis in economics. With pandas, you can compute aggregates across the entire dataset or grouped by certain columns. These operations are essential for understanding data distributions, identifying patterns, and preparing summary tables for research papers.
16.8.1 Simple Aggregations
# Basic statisticsprint("Mean:",df["height"].mean()) print("Median:", df["height"].median()) print("Standard deviation:", df["height"].std())print("Variance:",df["height"].var())# Range statisticsprint("Min:",df["height"].min()) print("Max:",df["height"].max())print("First quartile:",df["height"].quantile(0.25)) print() print("Multiple quartiles:")print(df["height"].quantile([0.25, 0.5, 0.75]))print()# Countsprint("No. of non-missing values: ",df["nationality"].count())print("No. of unique values; ",df["nationality"].nunique())
Mean: 1.766282119914347
Median: 1.76
Standard deviation: 0.11271869117624804
Variance: 0.012705503340486377
Min: 1.21
Max: 2.21
First quartile: 1.69
Multiple quartiles:
0.25 1.69
0.50 1.76
0.75 1.84
Name: height, dtype: float64
No. of non-missing values: 11538
No. of unique values; 207
You can apply these directly to columns or the entire DataFrame (if all columns are numeric).
16.8.2 Grouped Aggregations
The .groupby() function is pandas’ primary tool for grouped aggregations. It lets you:
Split the data into groups
Apply operations to each group independently
Combine the results into a new DataFrame
When you perform a groupby operation, pandas automatically creates an index from your grouping columns, following the index concepts introduced earlier:
# Group by country and compute mean height# Result is a new DataFrameheight_by_nat = df.groupby("nationality")["height"].mean()print(height_by_nat)
nationality
AFG 1.736667
ALB 1.761667
ALG 1.756324
AND 1.712000
ANG 1.762308
...
VIE 1.663913
VIN 1.755000
YEM 1.696667
ZAM 1.766667
ZIM 1.676000
Name: height, Length: 207, dtype: float64
/tmp/ipykernel_4785/614891609.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
height_by_nat = df.groupby("nationality")["height"].mean()
# Multiple aggregations for one column# Result has country as index, aggregation functions as columnsheight_by_nat = df.groupby("nationality")["height"].agg(["mean", "std"])print(height_by_nat)
mean std
nationality
AFG 1.736667 0.080829
ALB 1.761667 0.157913
ALG 1.756324 0.085935
AND 1.712000 0.054498
ANG 1.762308 0.065685
... ... ...
VIE 1.663913 0.073407
VIN 1.755000 0.035355
YEM 1.696667 0.040415
ZAM 1.766667 0.069186
ZIM 1.676000 0.093971
[207 rows x 2 columns]
/tmp/ipykernel_4785/1827147528.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
height_by_nat = df.groupby("nationality")["height"].agg(["mean", "std"])
You can group by multiple columns to create a hierarchical structure:
# Group by sector and region, # aggregating multiple columnssummaries = df.groupby( ["nationality", "sport"]).agg({"height": ["mean", "std"],"weight": ["mean", "std"]})print(summaries)
height weight
mean std mean std
nationality sport
AFG aquatics NaN NaN NaN NaN
archery NaN NaN NaN NaN
athletics 1.7 0.070711 61.5 9.192388
badminton NaN NaN NaN NaN
basketball NaN NaN NaN NaN
... ... ... ... ...
ZIM tennis NaN NaN NaN NaN
triathlon NaN NaN NaN NaN
volleyball NaN NaN NaN NaN
weightlifting NaN NaN NaN NaN
wrestling NaN NaN NaN NaN
[5796 rows x 4 columns]
/tmp/ipykernel_4785/3630171297.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
summaries = df.groupby(
When grouping by multiple columns like this, pandas creates a hierarchical index (also called a multi-level index). This structure is particularly valuable in economics because it allows you to analyze data at different levels of granularity.
16.8.3 Working with Multi-Level Indices
Multi-level indices are powerful tools for analyzing hierarchically structured data. They’re particularly valuable in economics for analyzing nested relationships like country-year observations or sector-region combinations. Here’s how to work with them effectively:
Accessing Data: There are several ways to select data from a multi-level index:
# Get all data for one country (returns all regions for USA)print(summaries.loc["USA"])
# Get data for a specific nation-sport pair using a tupleprint(summaries.loc[("USA", "tennis")])
height mean 1.820000
std 0.077460
weight mean 73.800000
std 7.420692
Name: (USA, tennis), dtype: float64
# Cross-section: get all football players# .xs() is especially useful for selecting from inner levelsprint(summaries.xs("football", level="sport"))
height weight
mean std mean std
nationality
AFG NaN NaN NaN NaN
ALB NaN NaN NaN NaN
ALG 1.771818 0.089954 69.363636 6.245345
AND NaN NaN NaN NaN
ANG NaN NaN NaN NaN
... ... ... ... ...
VIE NaN NaN NaN NaN
VIN NaN NaN NaN NaN
YEM NaN NaN NaN NaN
ZAM NaN NaN NaN NaN
ZIM 1.639091 0.055370 58.409091 5.670803
[207 rows x 4 columns]
Index Operations: You can reshape your data structure:
# Convert hierarchical index into regular columns# Useful when you need a flat table formatdf_flat = summaries.reset_index()# Change the order of index levels (e.g., sport-nationality instead of nationality-sport)df_reordered = summaries.swaplevel()# Sort the index for faster data accessdf_sorted = summaries.sort_index()
Aggregations: Calculate statistics at different hierarchical levels:
# The level= parameter is crucial when grouping by index levels instead of columns# Without level=, pandas would try to find a column named "sport" and raise a KeyError# With level="sport", pandas knows to group by the "sport" level of the indexprint(summaries.groupby(level="sport").mean())
Grouped operations are essential for subsetting data by categories (e.g., region, industry, time period) and calculating summary statistics (e.g., average GDP, total population). They form the backbone of many empirical economics workflows (think cross-section vs. panel-level computations, or region-by-region analysis).
When a DataFrame has labeled indices (e.g., custom row labels or dates as an index), .loc is used:
# If rows are labeled by date, for examplerow_date = df.loc["2020-01-01"]# Slicing by labelsubset_label = df.loc["2020-01-01":"2020-01-10"]
16.9.3 Boolean Indexing (Filtering)
Boolean indexing is a powerful feature in pandas that lets you filter data using logical conditions. It works by creating a boolean mask (an array of True/False values) that determines which rows to keep.
16.9.3.1 Basic Boolean Indexing
# Create a simple boolean maskhigh_income = df["income"] >50000print(high_income) # Shows True/False for each row# 0 True# 1 False# 2 True# ...# Use the mask to filter the DataFramerich_people = df[high_income]
16.9.3.2 Multiple Conditions
You can combine multiple conditions using logical operators:
& (and)
| (or)
~ (not)
Important: Use parentheses around each condition and special operators & and | and not and and or as in Python:
# Wrong: # df[df["income"] > 50000 and # df["country"] == "USA"]# Correct:rich_americans = df[ (df["income"] >50000) & (df["country"] =="USA")]# Multiple complex conditionstarget_group = df[ (df["income"] >50000) &# High income (df["age"] <40) &# Young ( (df["country"] =="USA") | (df["country"] =="Canada") ) # North American]
16.9.3.3 Common Use Cases
Filtering outliers:
# Remove extreme values # (e.g., beyond 3 standard deviations)mean = df["value"].mean()std = df["value"].std()clean_df = df[ (df["value"] >= mean -3*std) & (df["value"] <= mean +3*std)]
Time period isolation:
# Data from 2020 onwardsrecent_data = df[df["date"] >="2020-01-01"]
Category-based filtering:
# Large companies in specific sectorslarge_tech = df[ (df["market_cap"] >1e9) &# Over 1B (df["sector"].isin(["Technology", "Communications" ]))]
Missing value handling:
# Find rows with complete data for specific columnscomplete_cases = df[df[["income", "education", "age"]].notna().all(axis=1)]
Remember that boolean masks can be saved to variables for clarity and reuse, making your code more readable and maintainable.
16.10 Selecting Columns
Columns in a DataFrame can be treated much like keys in a dictionary:
# Single columngdp_col = df["nationality"]# Multiple columnssubset_cols = df[["nationality", "sport"]]# Dot-notation works if the column name has no spaces or special charactersdf.nationality
If your column names contain spaces or punctuation, you must stick with bracket notation (e.g., df["GDP (in billions)"]).
16.10.1 Advanced Column Selection
You can select columns based on their data types or using pattern matching:
# Select all numeric columnsnumeric_cols = df.select_dtypes(include=['int64', 'float64'])# Select all string (object) columnsstring_cols = df.select_dtypes(include=['object'])# Select multiple typescategorical_or_string = df.select_dtypes(include=['category', 'object'])# Exclude certain typesnon_numeric = df.select_dtypes(exclude=['int64', 'float64'])# Select columns whose names match a patterngdp_cols = df.filter(like='nat') # Columns containing 'nat'year_cols = df.filter(regex='^year') # Columns starting with 'year', uses regular expression parameter
16.10.2 Creating New Columns
You can create new columns from existing ones using arithmetic operations, functions, or more complex transformations:
You can apply operations to groups of columns using column selection methods. The .apply() function is particularly useful here - it allows you to run any function on each element (or column) of your data. Think of it as a more flexible version of a for-loop that works efficiently with pandas data structures.
# Demean all numeric columns # (subtract mean from each value)numeric_cols = df.select_dtypes( include=['int64', 'float64'])df[numeric_cols.columns] = ( df[numeric_cols.columns] - df[numeric_cols.columns].mean())# Scale all numeric columns to [0,1] rangedef min_max_scale(x):return ( (x - x.min()) / (x.max() - x.min()) )# apply() runs min_max_scale on each columndf[numeric_cols.columns] = ( df[numeric_cols.columns].apply( min_max_scale ))# Log transform all positive numeric colspositive_cols = numeric_cols.columns[ df[numeric_cols.columns].gt(0).all()]df[positive_cols] = np.log(df[positive_cols])# Standardize all numeric cols (z-score)df[numeric_cols.columns] = ( (df[numeric_cols.columns] - df[numeric_cols.columns].mean()) / df[numeric_cols.columns].std())
The .apply() function is versatile:
It can work on rows or columns (specify with axis=1 for rows)
It accepts both built-in functions and custom functions you define
It’s often used for complex transformations that can’t be expressed in a single operation
16.11 Dropping rows and columns
Sometimes you need to remove unwanted rows or columns from your DataFrame. pandas provides the drop() function for this purpose:
16.11.1 Dropping Columns
# Drop a single columndf_no_weight = df.drop('weight', axis=1)# Drop multiple columnsdf_cleaned = df.drop(['weight', 'dob'], axis=1)# Drop columns by positiondf_subset = df.drop(df.columns[2:4], axis=1) # Drop 3rd and 4th columns# Drop columns matching a patterncols_to_drop = df.filter(like='_old').columns # Columns containing '_old'df_new = df.drop(cols_to_drop, axis=1)
16.11.2 Dropping Rows
# Drop rows by index labeldf_no_usa = df.drop('USA') # If 'USA' is an index valuedf_subset = df.drop(['USA', 'UK']) # Drop multiple rows# Drop rows by positiondf_no_first = df.drop(0) # Drop first rowdf_subset = df.drop([0, 2]) # Drop first and third rows# Drop duplicate rowsdf_unique = df.drop_duplicates()df_unique_subset = df.drop_duplicates(subset=['Country']) # Consider only 'Country' column# Drop rows based on conditiondf_high_gdp = df.drop(df[df['medals'] ==0].index) # Drop rows where GDP < 1000
Remember that drop operations return a new DataFrame by default. To modify the original DataFrame, use inplace=True:
# Modify DataFrame in placedf.drop('weight', axis=1, inplace=True)
16.12 Missing Values
Real-world datasets often have missing entries, represented by NaN (Not a Number). For economists, missing data might arise from incomplete surveys, unavailable country statistics for certain years, or partial reporting by firms.
16.12.1 Finding Missing Values
pandas provides several functions for detecting missing values:
isnull() and isna(): Return a boolean mask indicating which values are missing
notnull() and notna(): Return a boolean mask indicating which values are not missing
sum(): When applied to boolean values, counts the number of True values
These functions are essential for data quality assessment, as missing values can significantly impact statistical analyses and model performance. Here’s how to use them:
# Boolean mask of missing values in column_xmissing_mask = df["column_x"].isnull() # Returns True for missing valuesprint(missing_mask) # Shows True/False for each row# Count missing values in each columnmissing_counts = df.isnull().sum() # Counts missing values per columnprint(missing_counts) # Shows count for each column# Calculate percentage of missing valuesmissing_percentages = (df.isnull().sum() /len(df)) *100print(missing_percentages) # Shows % missing for each column
The boolean masks returned by these functions can be used for filtering or as part of more complex data cleaning operations.
16.12.2 Dropping Missing Values
# Drop any rows with at least one missing valuedf_dropped_any = df.dropna()# Drop rows only if a certain column has missing valuesdf_dropped_subset = df.dropna(subset=["column_x"])
16.12.3 Filling Missing Values
Instead of dropping missing rows, you might choose to fill them:
# Different methods for filling missing values:df_filled = df["column_x"].fillna(0) # Replace NaN with a constant (zero)df_ffill = df["column_x"].fillna(method="ffill") # Forward fill: copy the last valid value forwarddf_bfill = df["column_x"].fillna(method="bfill") # Backward fill: copy the next valid value backward# Example with time series data:# Date Value# Jan-01 100# Jan-02 NaN -> ffill: 100, bfill: 150# Jan-03 150# Jan-04 NaN -> ffill: 150, bfill: 200# Jan-05 200# Fill with computed statisticsdf_mean = df["column_x"].fillna(df["column_x"].mean()) # Fill with column meandf_median = df["column_x"].fillna(df["column_x"].median()) # Fill with column median# You can also fill with group-specific means# For example, fill missing GDP values with the mean GDP for that country:df["GDP"] = df.groupby("country")["GDP"].transform(lambda x: x.fillna(x.mean()))
The appropriate method depends on your specific use case. Forward filling makes sense in time-series data when the last known value is a reasonable estimate for the missing one (like continuing a country’s GDP until new data is available). Backward filling might be preferred when future values are more relevant (like backfilling preliminary statistics with final numbers). Using means or medians is common when the missing values are believed to be random, while group-specific means can account for systematic differences between categories.
16.13 Merging and Joining DataFrames
When analyzing economic data, you often need to combine information from multiple sources. For example:
Computing GDP per capita requires:
A DataFrame with GDP values
Another DataFrame with population data
Analyzing international trade requires:
Trade flow data between countries
Exchange rate data for currency conversion
Country-level economic indicators
The pd.merge() function lets you combine DataFrames based on common columns, like matching country names or dates. Let’s see how this works with a simple example:
# First DataFrame: GDP datagdp_data = pd.DataFrame({'Country': ['USA', 'UK', 'Japan', 'Germany'],'GDP': [21.4, 2.7, 5.1, 4.2]})print("\nGDP Data:")print(gdp_data)# Country GDP# 0 USA 21.4# 1 UK 2.7# 2 Japan 5.1# 3 Germany 4.2# Second DataFrame: Population datapopulation_data = pd.DataFrame({'Country': ['USA', 'UK', 'France', 'Japan'],'Population': [332, 67, 65, 126]})print("\nPopulation Data:")print(population_data)# Country Population# 0 USA 332# 1 UK 67# 2 France 65# 3 Japan 126# Merge options demonstrate different ways to handle matches:# 1. Inner join (default) - only keep countries present in BOTH DataFramesinner_joined = pd.merge( gdp_data, population_data, on='Country', how='inner')print("\nInner Join Result (only matching countries):")print(inner_joined)# Country GDP Population# 0 USA 21.4 332# 1 UK 2.7 67# 2 Japan 5.1 126# 2. Left join - keep ALL countries from gdp_dataleft_joined = pd.merge( gdp_data, population_data, on='Country', how='left')print("\nLeft Join Result (all GDP countries):")print(left_joined)# Country GDP Population# 0 USA 21.4 332.0# 1 UK 2.7 67.0# 2 Japan 5.1 126.0# 3 Germany 4.2 NaN# 3. Right join - keep ALL countries from population_dataright_joined = pd.merge( gdp_data, population_data, on='Country', how='right')print("\nRight Join Result (all Population countries):")print(right_joined)# Country GDP Population# 0 USA 21.4 332# 1 UK 2.7 67# 2 Japan 5.1 126# 3 France NaN 65# 4. Outer join - keep ALL countries from BOTH DataFramesouter_joined = pd.merge( gdp_data, population_data, on='Country', how='outer')print("\nOuter Join Result (all countries):")print(outer_joined)# Country GDP Population# 0 USA 21.4 332.0# 1 UK 2.7 67.0# 2 Japan 5.1 126.0# 3 Germany 4.2 NaN# 4 France NaN 65.0
The how parameter controls which rows to keep:
'inner': Only keep rows where the Country appears in both DataFrames (intersection)
'left': Keep all countries from the left (first) DataFrame
'right': Keep all countries from the right (second) DataFrame
'outer': Keep all countries from both DataFrames (union)
When a country exists in one DataFrame but not the other, missing values are filled with NaN.
16.13.1 Joining on Different Column Names
Sometimes the columns you want to join on have different names in each DataFrame. For example:
# GDP data uses 'Nation' while population data uses 'Country'gdp_data = pd.DataFrame({'Nation': ['USA', 'UK', 'Japan'],'GDP': [21.4, 2.7, 5.1]})population_data = pd.DataFrame({'Country': ['USA', 'UK', 'France'],'Population': [332, 67, 65]})# Use left_on and right_on to specify different column namesmerged_data = pd.merge( gdp_data, population_data, left_on='Nation', # Column name in left (first) DataFrame right_on='Country', # Column name in right (second) DataFrame how='outer')print(merged_data)# Nation GDP Country Population# 0 USA 21.4 USA 332.0# 1 UK 2.7 UK 67.0# 2 Japan 5.1 NaN NaN# 3 NaN NaN France 65.0# The result contains both original columns# You can drop the redundant one:merged_data = merged_data.drop('Country', axis=1)
16.14 Reshaping Data with Pivot Tables
Economic data often comes in different formats depending on the source. Two common formats are:
Long format: Each row represents a single observation, with columns for identifiers (e.g., country, year) and values
Wide format: Each row represents an entity (e.g., country), with separate columns for each time period
Converting between these formats is a common task in economic analysis. Let’s see how to do this with pandas:
# Example: GDP by country and year in long formatgdp_long = pd.DataFrame({'Country': ['USA', 'USA', 'UK', 'UK'],'Year': [2020, 2021, 2020, 2021],'GDP': [20.9, 21.4, 2.6, 2.7]})print("\nLong Format:")print(gdp_long)# Country Year GDP# 0 USA 2020 20.9# 1 USA 2021 21.4# 2 UK 2020 2.6# 3 UK 2021 2.7# Convert to wide format using pivotgdp_wide = gdp_long.pivot( index='Country', # Rows columns='Year', # Columns values='GDP'# Values to fill)print("\nWide Format:")print(gdp_wide)# 2020 2021# Country # UK 2.6 2.7# USA 20.9 21.4# More complex pivot with multiple values and aggregations# Add trade balance datagdp_trade = pd.DataFrame({'Country': ['USA', 'USA', 'USA', 'UK', 'UK', 'UK'],'Year': [2020, 2020, 2021, 2020, 2020, 2021],'GDP': [20.9, 20.9, 21.4, 2.6, 2.6, 2.7],'Trade_Balance': [-0.5, -0.6, -0.7, 0.2, 0.3, 0.1]})# Pivot with aggregation functionsgdp_summary = pd.pivot_table( gdp_trade, index='Country', columns='Year', values=['GDP', 'Trade_Balance'], aggfunc={'GDP': 'mean', # Average if multiple values'Trade_Balance': ['mean', 'sum'] # Multiple aggregations })print("\nComplex Pivot Result:")print(gdp_summary)# GDP Trade_Balance # Year 2020 2021 mean sum # Country # UK 2.6 2.7 0.25 0.6 # USA 20.9 21.4 -0.55 -1.1
The pivot operations are particularly useful when: - Preparing data for time series analysis (wide format often preferred) - Creating summary tables for research papers - Computing year-over-year changes - Comparing multiple metrics across countries or time periods
16.15 Sorting
Sorting data is a fundamental operation in data analysis. pandas provides flexible methods to sort DataFrames by one or multiple columns, or by index values. This is particularly useful when working with economic time series or when preparing data for presentation.
16.15.1 Basic Sorting
# Sort by a single columndf_sorted = df.sort_values("GDP", ascending=False)# Sort by multiple columnsdf_sorted = df.sort_values(["Country", "Year"])# Sort by indexdf_sorted = df.sort_index()
16.15.2 Advanced Sorting Options
You can customize sorting behavior with additional parameters:
# Sort by multiple columns in different directionsdf_sorted = df.sort_values( ["Country", "GDP"], ascending=[True, False] # Ascending by Country, descending by GDP)# Handle missing values (NaN)df_sorted = df.sort_values("GDP", na_position="first"# Put NaN values at the beginning)
Remember that sorting operations return a new DataFrame by default. To modify the original DataFrame in place, use the inplace=True parameter.
16.16 Saving DataFrames
After manipulating your data, you’ll often want to save the results for future use or sharing with colleagues. pandas provides several methods to export DataFrames to various file formats, each suited to different needs.
16.16.1 Basic File Formats
For small to medium datasets, standard formats work well:
# Save to CSV (most common)df.to_csv("output.csv")# Save to Exceldf.to_excel("output.xlsx", sheet_name="Sheet1")# Save to JSONdf.to_json("output.json")
16.16.2 Advanced Export Options
You can customize the export process with various parameters:
# CSV with specific encoding and date formatdf.to_csv("output.csv", encoding="utf-8", date_format="%Y-%m-%d", index=False# Don't save index as a column)# Excel with multiple sheetswith pd.ExcelWriter("output.xlsx") as writer: df1.to_excel( writer, sheet_name="GDP Data" ) df2.to_excel( writer, sheet_name="Population Data" )# Compressed CSV for large datasetsdf.to_csv("output.csv.gz", compression="gzip") # gzip compressiondf.to_csv("output.csv.bz2", compression="bz2") # bzip2 compressiondf.to_csv("output.csv.zip", compression="zip") # zip compression# Parquet format - excellent for large datasetsdf.to_parquet("output.parquet", compression="snappy"# Fast compression, good for analytics)# Parquet with partitioning (e.g., by year and country)df.to_parquet("data_directory", partition_cols=["year", "country"] # Creates nested directories)
Remember to consider file size, readability, and compatibility when choosing an export format. CSV files are widely compatible but may lose data type information, while Excel files preserve formatting but may have size limitations.
16.17 Further Resources
For more advanced features and detailed documentation:
These resources provide comprehensive coverage of pandas’ capabilities, including advanced features not covered in this introduction.
16.18 Concluding Remarks
pandas is a cornerstone for data handling tasks in Python, especially when working with mixed-type, row-and-column datasets frequently encountered in economics. By combining the power of pandas with NumPy, you can:
Load and Inspect
Rapidly read data from files or databases and get immediate insights into your dataset’s shape, column names, and data types.
Filter and Subset
Flexibly select rows and columns using integer-based, label-based, or Boolean indexing—making it easy to isolate subsets for detailed analysis or cleaning.
Handle Missing Data
Identify, drop, or fill missing values in a systematic way, crucial for preparing data for econometric models or machine learning pipelines.
Aggregate
Compute descriptive statistics across the entire dataset or by groups. In economics, this is especially valuable for multi-country, multi-industry, or multi-year analyses.
Categorical and Continuous Data
Distinguish between data types for appropriate transformations and quick statistical summaries.
With these skills, you can confidently tackle most routine data management and exploratory tasks in Python. As your projects scale up or require more sophisticated operations, pandas provides additional features—merging, pivoting, time-series functionality, and more—that integrate seamlessly with the broader Python ecosystem for data analysis, econometrics, and machine learning.
---title: "Pandas: Working with Dataframes"---## IntroductionIn economics and data science, we often work with **tabular** data, whether they come from:- Surveys and administrative records- Macroeconomic databases (e.g., World Bank, IMF)- Financial market data feeds- Census and demographic data- Firm-level accounting statementsWhile NumPy arrays are highly efficient for numerical operations, they lack some of the conveniences required to handle real-world economic data:- Row and column labels (e.g., country names, dates)- Mixed data types (e.g., numeric GDP values with categorical country codes)- Missing values (e.g., unreported statistics for some countries/years)- Hierarchical relationships (e.g., firms nested within industries within countries)This is where **pandas** excels. Built on top of NumPy, pandas provides flexible, Pythonic data structures—namely **Series** (1D) and **DataFrames** (2D)—that support common operations like data cleaning, subsetting, merging, and reshaping.The DataFrame in pandas is the closest analog to what economists and social scientists might see in Stata, R, or Excel: rows representing observations, columns representing variables, and the ability to label these axes in intuitive ways. This chapter will guide you through:1. **Data Structures**: Understanding Series and DataFrames2. **Data Loading**: Reading from files and databases3. **Data Cleaning**: Handling missing values and data types4. **Data Manipulation**: Filtering, sorting, and transforming5. **Data Analysis**: Computing statistics and aggregations6. **Data Export**: Saving results in various formatsThese skills are crucial for real-world applications in economics and machine learning, where data rarely comes in a perfectly clean and ready-to-use format.## Pandas Series: The Building Block of DataFramesBefore diving deeper into DataFrames, it's crucial to understand **Series**, the fundamental building block of pandas. A Series is a one-dimensional labeled array that can hold any data type (integers, strings, floats, Python objects, etc.). In economic data analysis, individual Series often represent:- Time series of a single variable (e.g., monthly GDP values)- Cross-sectional observations (e.g., population across countries)- Categories or classifications (e.g., industry sectors)Think of a Series as analogous to a single column in Stata or a vector in R, but with added capabilities for handling missing data and maintaining row labels (index). This labeling is particularly valuable when working with economic data where observations often have meaningful identifiers like country names, dates, or industry codes.### Relationship with NumPySince pandas is built on top of NumPy, there's a close relationship between pandas Series/DataFrames and NumPy arrays:```pythonimport numpy as npimport pandas as pd# Convert NumPy array to Seriesarr = np.array([1, 2, 3])s = pd.Series(arr)# Convert Series to NumPy arrayarr_back = s.to_numpy()# Convert 2D NumPy array to DataFramearr_2d = np.array([ [1, 2, 3], [4, 5, 6]])df = pd.DataFrame( arr_2d, columns=['A', 'B', 'C'])# Convert DataFrame to 2D NumPy arrayarr_2d_back = df.to_numpy()# NumPy functions work on Series/DataFrames# Element-wise square root of Seriesprint(np.sqrt(s)) # Mean of each DataFrame columnprint(np.mean(df)) # Exponential of one columnprint(np.exp(df['A'])) # Universal functions (ufuncs) work too# Element-wise exponentialprint(np.exp(s)) ```While pandas builds on NumPy's efficient array operations, there are several key differences that make pandas Series more flexible for data analysis. First, Series have an index that labels each position in the array, allowing for intuitive data access using meaningful identifiers rather than just integer positions. This is particularly useful when working with time series data or when rows represent distinct entities like countries or companies.Additionally, Series preserve pandas-specific features that aren't available in NumPy arrays, such as sophisticated missing value handling.### Creating SeriesYou can create a Series from a list, NumPy array, or dictionary:```pythonimport pandas as pdimport numpy as np# From a lists1 = pd.Series( [10, 20, 30, 40])print(s1)# 0 10# 1 20# 2 30# 3 40# dtype: int64# From a NumPy arrays2 = pd.Series( np.array([1, 3, 5, 7]))print(s2)# 0 1# 1 3# 2 5# 3 7# dtype: int64# From a dictionary (keys become index)s3 = pd.Series({'a': 100, 'b': 200, 'c': 300})print(s3)# a 100# b 200# c 300# dtype: int64```## Pandas DataFramesA DataFrame is essentially a collection of Series objects that share the same index. Each column in a DataFrame is a Series. This relationship is crucial for understanding how to extract and manipulate data within a DataFrame.### Creating a DataFrame by combining SeriesYou can create a DataFrame by combining multiple Series objects. The Series must share the same index for proper alignment:```python# Create individual Seriesgdp = pd.Series( [19.4, 3.0, 4.9], index=['USA', 'UK', 'Japan'])pop = pd.Series( [332, 67, 126], index=['USA', 'UK', 'Japan'])# Combine into DataFramedf = pd.DataFrame({'GDP_trillion': gdp,'Population_million': pop})print(df)# GDP_trillion Population_million# USA 19.4 332.0# UK 3.0 67.0# Japan 4.9 126.0```The resulting DataFrame uses the Series' shared index as row labels and the dictionary keys as column names.Later we will see how to import data from files into DataFrames.### Extracting a Series from a DataFrameYou can extract a column (which is a Series) from a DataFrame using bracket notation:```python# Assuming df is a DataFramegdp_series = df["GDP"]print(type(gdp_series))# <class 'pandas.core.series.Series'>```### Adding a Series to a DataFrameYou can add a Series as a new column to a DataFrame:```python# Create a new Seriesnew_series = pd.Series( [100, 200, 300], index=['USA', 'UK', 'Japan'])# Add it as a new column df["New_Column"] = new_seriesprint(df)# GDP New_Column# USA 100 100# UK 200 200# Japan 300 300```## Understanding Row Identification in pandasBefore diving into specific operations, it's important to understand how pandas identifies and organizes rows in a DataFrame. This conceptual foundation is crucial for effectively working with data, especially when performing operations like filtering, grouping, or panel data analysis.### Index Types1. **Default Index** - When you create a DataFrame without specifying an index, pandas assigns a default integer index (0, 1, 2, ...).```python df = pd.DataFrame({'GDP': [100, 200, 300]})print(df)# GDP# 0 100# 1 200# 2 300```2. **Label Index** - You can assign meaningful labels to rows using any hashable type (strings, integers, dates). - Common in economics for time series or cross-sectional data.```python df.index = ['USA', 'UK', 'Japan']print(df)# GDP# USA 100# UK 200# Japan 300```3. **Multi-Level (Hierarchical) Index** - Rows can be identified by multiple levels of labels. - Essential for panel data and nested categorizations.```python# Create DataFrame with multi-level index df = pd.DataFrame({'GDP': [100, 120, 200, 220],'Year': [2020, 2021, 2020, 2021],'Country': ['USA', 'USA', 'UK', 'UK'] })# Set multi-level index df_panel = df.set_index( ['Country', 'Year'] )print(df_panel)# GDP# Country Year # USA 2020 100# 2021 120# UK 2020 200# 2021 220```### Working with Indices1. **Setting and Resetting**```python# Set a single column as index df = df.set_index('date')# Set multiple columns as index levels df = df.set_index(['country', 'year'])# Convert index back to regular columns df = df.reset_index()```2. **Index Operations**```python# Check index propertiesprint(df.index.names) # Names of index levelsprint(df.index.levels) # Values in each level# Sort by index df = df.sort_index()```3. **Accessing Data** - Different index types enable various data access patterns:```python# Single-level index gdp_usa = df.loc['USA']# Multi-level index gdp_usa_2020 = df.loc[('USA', 2020)]```Understanding these index concepts is fundamental to effective data analysis in pandas. The index system determines how you select and filter data, enables powerful grouping and aggregation operations, and provides natural representation for hierarchical data structures. This flexibility is particularly important when working with time series, where dates serve as indices, or when dealing with panel data that requires multiple levels of identification.The hierarchical indexing capabilities of pandas make it especially powerful for handling complex data structures. Whether you're analyzing panel data with country-year observations, studying nested relationships across industries and regions over time, or conducting cross-sectional studies with multiple categorical dimensions, the index system provides the foundation for efficient data organization and analysis.## Loading and Inspecting DatasetsReal-world data often exist in CSV files, Excel spreadsheets, or databases. pandas provides convenient functions to load data from these sources. For example, we load a dataset in CSV format which contain details of atheletes who participate in a particular year's Olympic games.```{python}#| eval: trueimport pandas as pd# Reading a CSV filedf = pd.read_csv("../data/athletes.csv")# Peek at the first few rowsprint(df.head())# Get DataFrame shape (rows, columns)print(df.shape)# Check column namesprint(df.columns)```- **`pd.read_csv()`**: Reads comma-separated values (CSV). You can customize delimiters (`sep='\t'` for TSV files), specify headers, handle missing values, and more via additional parameters.- **`df.head()`**: Shows the first 5 rows (or another number you specify).- **`df.shape`**: Returns a tuple $(\text{rows},\text{columns})$.- **`df.columns`**: Lists the column names, which is helpful to quickly see the variables available in the dataset.You can also load Excel files with `pd.read_excel("file.xlsx")`, JSON data with `pd.read_json("file.json")`, and more.If you'd like to work along with the examples in this chapter, you can download it from <https://mlbook.jyotirmoy.net/static/data/athletes.csv>. Adjust the path in the `read_csv` call to the path where you saved the file.### Examining Continuous ColumnsMany datasets have both continuous (height, weight) and categorical (e.g., sport, nationality) columns. You can quickly **describe** numeric and non-numeric data:```{python}#| eval: true# Summary of numeric columnsprint(df.describe())```The output shows summary statistics for numeric columns:- count: Number of non-null (non-missing) values for each column- mean: Average value- std: Standard deviation- min: Minimum value- 25%: First quartile (25th percentile)- 50%: Median (50th percentile)- 75%: Third quartile (75th percentile)- max: Maximum value### Examining categorical columns```{python}#|eval: trueprint(df.describe(include=["object", "category"]))```For **categorical** data, `.describe()` shows count, unique category count, top category, and frequency of the top category.## Data Types (`dtypes`)Each column in a DataFrame has an associated data type, which can significantly impact memory usage and performance. Common dtypes include **float**, **int**, **bool**, **datetime**, and **object** (for strings or mixed content).```{python}#| eval: trueprint(df.dtypes)```This reveals how pandas interprets each column. If a particular column (like a categorical variable or date) is not recognized correctly, you can change its type:When reading CSV files, string columns are automatically stored as 'object' dtype, which is pandas' way of handling Python strings and mixed data types. While flexible, object dtypes consume more memory because each value is stored as a separate Python object with its own overhead.For columns with a limited set of unique values (like 'sex', 'nationality'), converting to 'category' dtype can significantly reduce memory usage and improve performance:```{python}#| eval: true# Convert string columns to categoricaldf['sex'] = df['sex'].astype('category')df['nationality'] = df['nationality'].astype('category')df['sport'] = df['sport'].astype('category')```The `dob` column was not correctly recognized that as a date. Lets convert that too:```{python}#| eval: true# Convert date columns from object to datetime# For example, if 'dob' column contains dates like '01/15/95':df['dob'] = pd.to_datetime( df['dob'], format='%m/%d/%y')# Now df['dob'] is datetime64[ns] type instead of object```## Cross-tabulationCross-tabulation (or contingency tables) is a method to analyze the relationship between categorical variables by displaying their frequency distributions in a table format. In economics, this is particularly useful for:- Analyzing employment status across education levels- Studying industry distribution across regions- Examining trade relationships between countries- Understanding market segment distributionspandas provides the `pd.crosstab()` function for creating these tables:```{python}#| eval: true# Basic cross-tabulationpd.crosstab( df['nationality'], # Row variable df['sex'] # Column variable)``````{python}#| eval: true# With margins (row/column totals)pd.crosstab( df['nationality'], df['sex'], margins=True# Add row/column totals)``````{python}#| eval: true# Normalize by rows (proportions)pd.crosstab( df['nationality'], df['sex'], normalize='index'# Row percentages)``````{python}# Normalize by columns (proportions)pd.crosstab( df['nationality'], df['sex'], normalize='columns'# Column percentages)``````{python}#| eval: true# Multiple row variables by # providing a list of columns# as the rowspd.crosstab( [df['nationality'],df['sport']], df['sex'] )```We can similarly have multiple columns to create a hierarchical table.The `aggfunc` parameter in crosstab determines how to combine multiple values that fall into the same category combinations. For example, when analyzing company data by Industry and Region:- `sum`: Adds up all values (e.g., total revenue for each Industry-Region pair)- `mean`: Calculates average (e.g., average company size per Industry-Region)- `count`: Counts occurrences (e.g., number of companies in each Industry-Region)- `min/max`: Finds extreme values (e.g., largest company in each Industry-Region)Custom functions can also be used for more complex aggregations.Suppose for each country we want the average height men and women.```{python}#| eval: true# With aggregation functionpd.crosstab( df['nationality'], df['sex'], values=df['height'], # Values to aggregate aggfunc='mean'# Aggregation function)```The `aggfunc` parameter determines how values are combined when multiple rows map to the same Industry-Region combination. For example, if you have multiple companies in the same industry and region, `aggfunc='sum'` will add their revenues together. Other common options include:- 'mean': Average value (e.g., average revenue per industry-region)- 'count': Number of occurrences- 'min'/'max': Extreme values- Custom functions can also be used for more complex aggregations## Computing Aggregates and Summary StatisticsSummaries and group-level statistics are integral to data analysis in economics. With pandas, you can compute aggregates across the entire dataset or grouped by certain columns. These operations are essential for understanding data distributions, identifying patterns, and preparing summary tables for research papers.### Simple Aggregations```{python}#| eval: true# Basic statisticsprint("Mean:",df["height"].mean()) print("Median:", df["height"].median()) print("Standard deviation:", df["height"].std())print("Variance:",df["height"].var())# Range statisticsprint("Min:",df["height"].min()) print("Max:",df["height"].max())print("First quartile:",df["height"].quantile(0.25)) print() print("Multiple quartiles:")print(df["height"].quantile([0.25, 0.5, 0.75]))print()# Countsprint("No. of non-missing values: ",df["nationality"].count())print("No. of unique values; ",df["nationality"].nunique()) ```You can apply these directly to columns or the entire DataFrame (if all columns are numeric).### Grouped AggregationsThe `.groupby()` function is pandas' primary tool for grouped aggregations. It lets you:1. Split the data into groups2. Apply operations to each group independently3. Combine the results into a new DataFrameWhen you perform a groupby operation, pandas automatically creates an index from your grouping columns, following the index concepts introduced earlier:```{python}#| eval: true# Group by country and compute mean height# Result is a new DataFrameheight_by_nat = df.groupby("nationality")["height"].mean()print(height_by_nat)``````{python}#| eval: true# Multiple aggregations for one column# Result has country as index, aggregation functions as columnsheight_by_nat = df.groupby("nationality")["height"].agg(["mean", "std"])print(height_by_nat)```You can group by multiple columns to create a hierarchical structure:```{python}#| eval: true# Group by sector and region, # aggregating multiple columnssummaries = df.groupby( ["nationality", "sport"]).agg({"height": ["mean", "std"],"weight": ["mean", "std"]})print(summaries)```When grouping by multiple columns like this, pandas creates a **hierarchical index** (also called a multi-level index). This structure is particularly valuable in economics because it allows you to analyze data at different levels of granularity. ### Working with Multi-Level IndicesMulti-level indices are powerful tools for analyzing hierarchically structured data. They're particularly valuable in economics for analyzing nested relationships like country-year observations or sector-region combinations. Here's how to work with them effectively:1. **Accessing Data**: There are several ways to select data from a multi-level index:```{python}#| eval: True# Get all data for one country (returns all regions for USA)print(summaries.loc["USA"])``````{python}#| eval: True# Get data for a specific nation-sport pair using a tupleprint(summaries.loc[("USA", "tennis")])``````{python}#| eval: true# Cross-section: get all football players# .xs() is especially useful for selecting from inner levelsprint(summaries.xs("football", level="sport"))```2. **Index Operations**: You can reshape your data structure:```python# Convert hierarchical index into regular columns# Useful when you need a flat table formatdf_flat = summaries.reset_index()# Change the order of index levels (e.g., sport-nationality instead of nationality-sport)df_reordered = summaries.swaplevel()# Sort the index for faster data accessdf_sorted = summaries.sort_index()```3. **Aggregations**: Calculate statistics at different hierarchical levels:```{python}# The level= parameter is crucial when grouping by index levels instead of columns# Without level=, pandas would try to find a column named "sport" and raise a KeyError# With level="sport", pandas knows to group by the "sport" level of the indexprint(summaries.groupby(level="sport").mean())```Grouped operations are essential for subsetting data by categories (e.g., region, industry, time period) and calculating summary statistics (e.g., average GDP, total population). They form the backbone of many empirical economics workflows (think cross-section vs. panel-level computations, or region-by-region analysis).## Selecting Rows### Integer Location: `iloc`Use `.iloc` for **integer-based** indexing:```python# Single rowrow_5 = df.iloc[5]# Multiple contiguous rowsfirst_ten = df.iloc[0:10]# Non-contiguous row indicessome_rows = df.iloc[[2, 5, 10]]```### Label-Based Selection: `loc`When a DataFrame has labeled indices (e.g., custom row labels or dates as an index), `.loc` is used:```python# If rows are labeled by date, for examplerow_date = df.loc["2020-01-01"]# Slicing by labelsubset_label = df.loc["2020-01-01":"2020-01-10"]```### Boolean Indexing (Filtering)Boolean indexing is a powerful feature in pandas that lets you filter data using logical conditions. It works by creating a boolean mask (an array of True/False values) that determines which rows to keep.#### Basic Boolean Indexing```python# Create a simple boolean maskhigh_income = df["income"] >50000print(high_income) # Shows True/False for each row# 0 True# 1 False# 2 True# ...# Use the mask to filter the DataFramerich_people = df[high_income]```#### Multiple ConditionsYou can combine multiple conditions using logical operators:- `&` (and)- `|` (or)- `~` (not)Important: Use parentheses around each condition and special operators `&` and `|` and not `and` and `or` as in Python:```python# Wrong: # df[df["income"] > 50000 and # df["country"] == "USA"]# Correct:rich_americans = df[ (df["income"] >50000) & (df["country"] =="USA")]# Multiple complex conditionstarget_group = df[ (df["income"] >50000) &# High income (df["age"] <40) &# Young ( (df["country"] =="USA") | (df["country"] =="Canada") ) # North American]```#### Common Use Cases1. Filtering outliers:```python# Remove extreme values # (e.g., beyond 3 standard deviations)mean = df["value"].mean()std = df["value"].std()clean_df = df[ (df["value"] >= mean -3*std) & (df["value"] <= mean +3*std)]```2. Time period isolation:```python# Data from 2020 onwardsrecent_data = df[df["date"] >="2020-01-01"]```3. Category-based filtering:```python# Large companies in specific sectorslarge_tech = df[ (df["market_cap"] >1e9) &# Over 1B (df["sector"].isin(["Technology", "Communications" ]))]```4. Missing value handling:```python# Find rows with complete data for specific columnscomplete_cases = df[df[["income", "education", "age"]].notna().all(axis=1)]```Remember that boolean masks can be saved to variables for clarity and reuse, making your code more readable and maintainable.## Selecting ColumnsColumns in a DataFrame can be treated much like keys in a dictionary:```python# Single columngdp_col = df["nationality"]# Multiple columnssubset_cols = df[["nationality", "sport"]]# Dot-notation works if the column name has no spaces or special charactersdf.nationality```If your column names contain spaces or punctuation, you must stick with bracket notation (e.g., `df["GDP (in billions)"]`).### Advanced Column SelectionYou can select columns based on their data types or using pattern matching:```python# Select all numeric columnsnumeric_cols = df.select_dtypes(include=['int64', 'float64'])# Select all string (object) columnsstring_cols = df.select_dtypes(include=['object'])# Select multiple typescategorical_or_string = df.select_dtypes(include=['category', 'object'])# Exclude certain typesnon_numeric = df.select_dtypes(exclude=['int64', 'float64'])# Select columns whose names match a patterngdp_cols = df.filter(like='nat') # Columns containing 'nat'year_cols = df.filter(regex='^year') # Columns starting with 'year', uses regular expression parameter```### Creating New ColumnsYou can create new columns from existing ones using arithmetic operations, functions, or more complex transformations:```python# Total medalsdf["medals"] = df["gold"] + df["silver"] + df["bronze"]# Using numpy functionsdf["log_height"] = np.log(df["height"])# Percentage change from previous yeardf["GDP_growth"] = df["GDP"].pct_change() *100# Multiple operationsdf["relative_size"] = (df["height"] / df["height"].mean()) *100# Conditional creationdf["won_something"] = np.where( df["medals"] >0,"Yes", "No")```### Operations on Column GroupsYou can apply operations to groups of columns using column selection methods. The `.apply()` function is particularly useful here - it allows you to run any function on each element (or column) of your data. Think of it as a more flexible version of a for-loop that works efficiently with pandas data structures.```python# Demean all numeric columns # (subtract mean from each value)numeric_cols = df.select_dtypes( include=['int64', 'float64'])df[numeric_cols.columns] = ( df[numeric_cols.columns] - df[numeric_cols.columns].mean())# Scale all numeric columns to [0,1] rangedef min_max_scale(x):return ( (x - x.min()) / (x.max() - x.min()) )# apply() runs min_max_scale on each columndf[numeric_cols.columns] = ( df[numeric_cols.columns].apply( min_max_scale ))# Log transform all positive numeric colspositive_cols = numeric_cols.columns[ df[numeric_cols.columns].gt(0).all()]df[positive_cols] = np.log(df[positive_cols])# Standardize all numeric cols (z-score)df[numeric_cols.columns] = ( (df[numeric_cols.columns] - df[numeric_cols.columns].mean()) / df[numeric_cols.columns].std())```The `.apply()` function is versatile:- It can work on rows or columns (specify with `axis=1` for rows)- It accepts both built-in functions and custom functions you define- It's often used for complex transformations that can't be expressed in a single operation## Dropping rows and columnsSometimes you need to remove unwanted rows or columns from your DataFrame. pandas provides the `drop()` function for this purpose:### Dropping Columns```python# Drop a single columndf_no_weight = df.drop('weight', axis=1)# Drop multiple columnsdf_cleaned = df.drop(['weight', 'dob'], axis=1)# Drop columns by positiondf_subset = df.drop(df.columns[2:4], axis=1) # Drop 3rd and 4th columns# Drop columns matching a patterncols_to_drop = df.filter(like='_old').columns # Columns containing '_old'df_new = df.drop(cols_to_drop, axis=1)```### Dropping Rows```python# Drop rows by index labeldf_no_usa = df.drop('USA') # If 'USA' is an index valuedf_subset = df.drop(['USA', 'UK']) # Drop multiple rows# Drop rows by positiondf_no_first = df.drop(0) # Drop first rowdf_subset = df.drop([0, 2]) # Drop first and third rows# Drop duplicate rowsdf_unique = df.drop_duplicates()df_unique_subset = df.drop_duplicates(subset=['Country']) # Consider only 'Country' column# Drop rows based on conditiondf_high_gdp = df.drop(df[df['medals'] ==0].index) # Drop rows where GDP < 1000```Remember that drop operations return a new DataFrame by default. To modify the original DataFrame, use `inplace=True`:```python# Modify DataFrame in placedf.drop('weight', axis=1, inplace=True)```## Missing ValuesReal-world datasets often have missing entries, represented by NaN (Not a Number). For economists, missing data might arise from incomplete surveys, unavailable country statistics for certain years, or partial reporting by firms.### Finding Missing Valuespandas provides several functions for detecting missing values:- `isnull()` and `isna()`: Return a boolean mask indicating which values are missing- `notnull()` and `notna()`: Return a boolean mask indicating which values are not missing- `sum()`: When applied to boolean values, counts the number of True valuesThese functions are essential for data quality assessment, as missing values can significantly impact statistical analyses and model performance. Here's how to use them:```python# Boolean mask of missing values in column_xmissing_mask = df["column_x"].isnull() # Returns True for missing valuesprint(missing_mask) # Shows True/False for each row# Count missing values in each columnmissing_counts = df.isnull().sum() # Counts missing values per columnprint(missing_counts) # Shows count for each column# Calculate percentage of missing valuesmissing_percentages = (df.isnull().sum() /len(df)) *100print(missing_percentages) # Shows % missing for each column```The boolean masks returned by these functions can be used for filtering or as part of more complex data cleaning operations.### Dropping Missing Values```python# Drop any rows with at least one missing valuedf_dropped_any = df.dropna()# Drop rows only if a certain column has missing valuesdf_dropped_subset = df.dropna(subset=["column_x"])```### Filling Missing ValuesInstead of dropping missing rows, you might choose to **fill** them:```python# Different methods for filling missing values:df_filled = df["column_x"].fillna(0) # Replace NaN with a constant (zero)df_ffill = df["column_x"].fillna(method="ffill") # Forward fill: copy the last valid value forwarddf_bfill = df["column_x"].fillna(method="bfill") # Backward fill: copy the next valid value backward# Example with time series data:# Date Value# Jan-01 100# Jan-02 NaN -> ffill: 100, bfill: 150# Jan-03 150# Jan-04 NaN -> ffill: 150, bfill: 200# Jan-05 200# Fill with computed statisticsdf_mean = df["column_x"].fillna(df["column_x"].mean()) # Fill with column meandf_median = df["column_x"].fillna(df["column_x"].median()) # Fill with column median# You can also fill with group-specific means# For example, fill missing GDP values with the mean GDP for that country:df["GDP"] = df.groupby("country")["GDP"].transform(lambda x: x.fillna(x.mean()))```The appropriate method depends on your specific use case. Forward filling makes sense in time-series data when the last known value is a reasonable estimate for the missing one (like continuing a country's GDP until new data is available). Backward filling might be preferred when future values are more relevant (like backfilling preliminary statistics with final numbers). Using means or medians is common when the missing values are believed to be random, while group-specific means can account for systematic differences between categories.## Merging and Joining DataFramesWhen analyzing economic data, you often need to combine information from multiple sources. For example:1. Computing GDP per capita requires: - A DataFrame with GDP values - Another DataFrame with population data2. Analyzing international trade requires: - Trade flow data between countries - Exchange rate data for currency conversion - Country-level economic indicatorsThe `pd.merge()` function lets you combine DataFrames based on common columns, like matching country names or dates. Let's see how this works with a simple example:```python# First DataFrame: GDP datagdp_data = pd.DataFrame({'Country': ['USA', 'UK', 'Japan', 'Germany'],'GDP': [21.4, 2.7, 5.1, 4.2]})print("\nGDP Data:")print(gdp_data)# Country GDP# 0 USA 21.4# 1 UK 2.7# 2 Japan 5.1# 3 Germany 4.2# Second DataFrame: Population datapopulation_data = pd.DataFrame({'Country': ['USA', 'UK', 'France', 'Japan'],'Population': [332, 67, 65, 126]})print("\nPopulation Data:")print(population_data)# Country Population# 0 USA 332# 1 UK 67# 2 France 65# 3 Japan 126# Merge options demonstrate different ways to handle matches:# 1. Inner join (default) - only keep countries present in BOTH DataFramesinner_joined = pd.merge( gdp_data, population_data, on='Country', how='inner')print("\nInner Join Result (only matching countries):")print(inner_joined)# Country GDP Population# 0 USA 21.4 332# 1 UK 2.7 67# 2 Japan 5.1 126# 2. Left join - keep ALL countries from gdp_dataleft_joined = pd.merge( gdp_data, population_data, on='Country', how='left')print("\nLeft Join Result (all GDP countries):")print(left_joined)# Country GDP Population# 0 USA 21.4 332.0# 1 UK 2.7 67.0# 2 Japan 5.1 126.0# 3 Germany 4.2 NaN# 3. Right join - keep ALL countries from population_dataright_joined = pd.merge( gdp_data, population_data, on='Country', how='right')print("\nRight Join Result (all Population countries):")print(right_joined)# Country GDP Population# 0 USA 21.4 332# 1 UK 2.7 67# 2 Japan 5.1 126# 3 France NaN 65# 4. Outer join - keep ALL countries from BOTH DataFramesouter_joined = pd.merge( gdp_data, population_data, on='Country', how='outer')print("\nOuter Join Result (all countries):")print(outer_joined)# Country GDP Population# 0 USA 21.4 332.0# 1 UK 2.7 67.0# 2 Japan 5.1 126.0# 3 Germany 4.2 NaN# 4 France NaN 65.0```The `how` parameter controls which rows to keep:- `'inner'`: Only keep rows where the Country appears in both DataFrames (intersection)- `'left'`: Keep all countries from the left (first) DataFrame- `'right'`: Keep all countries from the right (second) DataFrame - `'outer'`: Keep all countries from both DataFrames (union)When a country exists in one DataFrame but not the other, missing values are filled with `NaN`.### Joining on Different Column NamesSometimes the columns you want to join on have different names in each DataFrame. For example:```python# GDP data uses 'Nation' while population data uses 'Country'gdp_data = pd.DataFrame({'Nation': ['USA', 'UK', 'Japan'],'GDP': [21.4, 2.7, 5.1]})population_data = pd.DataFrame({'Country': ['USA', 'UK', 'France'],'Population': [332, 67, 65]})# Use left_on and right_on to specify different column namesmerged_data = pd.merge( gdp_data, population_data, left_on='Nation', # Column name in left (first) DataFrame right_on='Country', # Column name in right (second) DataFrame how='outer')print(merged_data)# Nation GDP Country Population# 0 USA 21.4 USA 332.0# 1 UK 2.7 UK 67.0# 2 Japan 5.1 NaN NaN# 3 NaN NaN France 65.0# The result contains both original columns# You can drop the redundant one:merged_data = merged_data.drop('Country', axis=1)```## Reshaping Data with Pivot TablesEconomic data often comes in different formats depending on the source. Two common formats are:- **Long format**: Each row represents a single observation, with columns for identifiers (e.g., country, year) and values- **Wide format**: Each row represents an entity (e.g., country), with separate columns for each time periodConverting between these formats is a common task in economic analysis. Let's see how to do this with pandas:```python# Example: GDP by country and year in long formatgdp_long = pd.DataFrame({'Country': ['USA', 'USA', 'UK', 'UK'],'Year': [2020, 2021, 2020, 2021],'GDP': [20.9, 21.4, 2.6, 2.7]})print("\nLong Format:")print(gdp_long)# Country Year GDP# 0 USA 2020 20.9# 1 USA 2021 21.4# 2 UK 2020 2.6# 3 UK 2021 2.7# Convert to wide format using pivotgdp_wide = gdp_long.pivot( index='Country', # Rows columns='Year', # Columns values='GDP'# Values to fill)print("\nWide Format:")print(gdp_wide)# 2020 2021# Country # UK 2.6 2.7# USA 20.9 21.4# More complex pivot with multiple values and aggregations# Add trade balance datagdp_trade = pd.DataFrame({'Country': ['USA', 'USA', 'USA', 'UK', 'UK', 'UK'],'Year': [2020, 2020, 2021, 2020, 2020, 2021],'GDP': [20.9, 20.9, 21.4, 2.6, 2.6, 2.7],'Trade_Balance': [-0.5, -0.6, -0.7, 0.2, 0.3, 0.1]})# Pivot with aggregation functionsgdp_summary = pd.pivot_table( gdp_trade, index='Country', columns='Year', values=['GDP', 'Trade_Balance'], aggfunc={'GDP': 'mean', # Average if multiple values'Trade_Balance': ['mean', 'sum'] # Multiple aggregations })print("\nComplex Pivot Result:")print(gdp_summary)# GDP Trade_Balance # Year 2020 2021 mean sum # Country # UK 2.6 2.7 0.25 0.6 # USA 20.9 21.4 -0.55 -1.1 ```The pivot operations are particularly useful when:- Preparing data for time series analysis (wide format often preferred)- Creating summary tables for research papers- Computing year-over-year changes- Comparing multiple metrics across countries or time periods## SortingSorting data is a fundamental operation in data analysis. pandas provides flexible methods to sort DataFrames by one or multiple columns, or by index values. This is particularly useful when working with economic time series or when preparing data for presentation.### Basic Sorting```python# Sort by a single columndf_sorted = df.sort_values("GDP", ascending=False)# Sort by multiple columnsdf_sorted = df.sort_values(["Country", "Year"])# Sort by indexdf_sorted = df.sort_index()```### Advanced Sorting OptionsYou can customize sorting behavior with additional parameters:```python# Sort by multiple columns in different directionsdf_sorted = df.sort_values( ["Country", "GDP"], ascending=[True, False] # Ascending by Country, descending by GDP)# Handle missing values (NaN)df_sorted = df.sort_values("GDP", na_position="first"# Put NaN values at the beginning)```Remember that sorting operations return a new DataFrame by default. To modify the original DataFrame in place, use the `inplace=True` parameter.## Saving DataFramesAfter manipulating your data, you'll often want to save the results for future use or sharing with colleagues. pandas provides several methods to export DataFrames to various file formats, each suited to different needs.### Basic File FormatsFor small to medium datasets, standard formats work well:```python# Save to CSV (most common)df.to_csv("output.csv")# Save to Exceldf.to_excel("output.xlsx", sheet_name="Sheet1")# Save to JSONdf.to_json("output.json")```### Advanced Export OptionsYou can customize the export process with various parameters:```python# CSV with specific encoding and date formatdf.to_csv("output.csv", encoding="utf-8", date_format="%Y-%m-%d", index=False# Don't save index as a column)# Excel with multiple sheetswith pd.ExcelWriter("output.xlsx") as writer: df1.to_excel( writer, sheet_name="GDP Data" ) df2.to_excel( writer, sheet_name="Population Data" )# Compressed CSV for large datasetsdf.to_csv("output.csv.gz", compression="gzip") # gzip compressiondf.to_csv("output.csv.bz2", compression="bz2") # bzip2 compressiondf.to_csv("output.csv.zip", compression="zip") # zip compression# Parquet format - excellent for large datasetsdf.to_parquet("output.parquet", compression="snappy"# Fast compression, good for analytics)# Parquet with partitioning (e.g., by year and country)df.to_parquet("data_directory", partition_cols=["year", "country"] # Creates nested directories)```Remember to consider file size, readability, and compatibility when choosing an export format. CSV files are widely compatible but may lose data type information, while Excel files preserve formatting but may have size limitations.## Further ResourcesFor more advanced features and detailed documentation:1. [pandas documentation](https://pandas.pydata.org/docs/)2. [pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html)3. [pandas API reference](https://pandas.pydata.org/docs/reference/index.html)These resources provide comprehensive coverage of pandas' capabilities, including advanced features not covered in this introduction.## Concluding Remarkspandas is a cornerstone for data handling tasks in Python, especially when working with mixed-type, row-and-column datasets frequently encountered in economics. By combining the power of **pandas** with **NumPy**, you can:1. **Load and Inspect** Rapidly read data from files or databases and get immediate insights into your dataset’s shape, column names, and data types.2. **Filter and Subset** Flexibly select rows and columns using integer-based, label-based, or Boolean indexing—making it easy to isolate subsets for detailed analysis or cleaning.3. **Handle Missing Data** Identify, drop, or fill missing values in a systematic way, crucial for preparing data for econometric models or machine learning pipelines.4. **Aggregate** Compute descriptive statistics across the entire dataset or by groups. In economics, this is especially valuable for multi-country, multi-industry, or multi-year analyses.5. **Categorical and Continuous Data** Distinguish between data types for appropriate transformations and quick statistical summaries.With these skills, you can confidently tackle most routine data management and exploratory tasks in Python. As your projects scale up or require more sophisticated operations, pandas provides additional features—merging, pivoting, time-series functionality, and more—that integrate seamlessly with the broader Python ecosystem for data analysis, econometrics, and machine learning.